Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers

نویسندگان

Jaeyoung Choi

Jack J. Dongarra

David W. Walker

چکیده

This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P Q processor template with a block scattered data distribution. P , Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for di erent groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM=GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to di erent processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A B , in the PUMMA package [5]. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers

This paper describes the Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PIJhlMA package includes not only the non-transposed matrix multiplication routine C = A . B. but also transposed multiplication routines C = AT . B, C = A . BT, and C = AT . BT, for a block scattered data distribution. The routines perform efficiently for a wide ...

متن کامل

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures running in a hybrid OpenMP/MPI configuration is presented. Significant boosts in speed are observed relative to the distributed transpose used in the state-of-the-art adaptive FFTW library. In some cases, a hybrid configuration allows one to reduce communication costs by reducing the number of MPI ...

متن کامل

High-performance and Parallel Inversion of a Symmetric Positive Definite Matrix

We present families of algorithms for operations related to the computation of the inverse of a Symmetric Positive Definite (SPD) matrix: Cholesky factorization, inversion of a triangular matrix, multiplication of a triangular matrix by its transpose, and one-sweep inversion of an SPD matrix. These algorithms are systematically derived and implemented via the Formal Linear Algebra Methodology E...

متن کامل

Load Balancing Problem for Parallel Computers with Distributed Memory

This paper deals with load balancing of parallel algorithms for distributedmemory computers. The parallel versions of BLAS subroutines for matrix-vector product and LU factorization are considered. Two task partitioning algorithms are investigated and speed-ups are calculated. The cases of homogeneous and heterogeneous collections of computers/processors are studied, and special partitioning al...

متن کامل

Technical Paper Accepted for Publication in Siam Review Software Libraries for Linear Algebra Computations on High Performance Computers 1 Software Libraries for Linear Algebra Computations on High Performance Computers

This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under develo...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

Parallel Computing

دوره 21 شماره

صفحات -

تاریخ انتشار 1995

Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers

نویسندگان

چکیده

منابع مشابه

Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

High-performance and Parallel Inversion of a Symmetric Positive Definite Matrix

Load Balancing Problem for Parallel Computers with Distributed Memory

Technical Paper Accepted for Publication in Siam Review Software Libraries for Linear Algebra Computations on High Performance Computers 1 Software Libraries for Linear Algebra Computations on High Performance Computers

عنوان ژورنال:

اشتراک گذاری